class: center, middle, inverse, title-slide # Panel data models ## Tutorial 2 ### Stanislav Avdeev --- # Goal for today's tutorial 1. Understand the panel structure of the data 1. Explore differences between pooled OLS, fixed and random effects estimators 1. Interpret the variation in the data 1. Make proper inferences using panel data models --- # Panel data - Panel data is when you observe the same individual over multiple time periods - "individual" could be a person, a company, a state, a country, etc. There are `\(N\)` individuals in the panel data - "time period" could be a year, a month, a day, etc. There are `\(T\)` time periods in the data - We assume that we observe each individual the same number of times, i.e. a *balanced* panel (so we have `\(N\times T\)` observations) - you can use these estimators with unbalanced panels too, it just gets a little more complex --- # Panel data - Let's use a dataset from `wooldridge` package on crime data - you can use a lot of datasets from different packages, such as `wooldridge` which contains datasets from "Introductory Econometrics: A Modern Approach" by Wooldridge J.M. - Here's what a panel data set looks like - a variable for individual (county), a variable for time (year), and then the data <table> <thead> <tr> <th style="text-align:right;"> County </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> CrimeRate </th> <th style="text-align:right;"> ProbofArrest </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 81 </td> <td style="text-align:right;"> 0.0398849 </td> <td style="text-align:right;"> 0.289696 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 82 </td> <td style="text-align:right;"> 0.0383449 </td> <td style="text-align:right;"> 0.338111 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> 0.0303048 </td> <td style="text-align:right;"> 0.330449 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 81 </td> <td style="text-align:right;"> 0.0163921 </td> <td style="text-align:right;"> 0.202899 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 82 </td> <td style="text-align:right;"> 0.0190651 </td> <td style="text-align:right;"> 0.162218 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> 0.0151492 </td> <td style="text-align:right;"> 0.181586 </td> </tr> </tbody> <tfoot> <tr> <td style = 'padding: 0; border:0;' colspan='100%'><sup></sup> 6 rows out of 630. "Prob. of Arrest" is estimated probability of being arrested when you commit a crime</td> </tr> </tfoot> </table> --- # Between and within variation Let's pick a few counties and graph this out <!-- --> --- # Between variation If we look at the **between** variation by using the **pooled** OLS estimator, we get this <!-- --> --- # Between variation **Between** variation looks at the relationship **between the means of each county** <!-- --> --- # Between variation The individual year-to-year variation within county doesn't matter <!-- --> --- # Within variation **Within** variation goes the other way: it looks at variation **within county from year-to-year** <!-- --> --- # Between and within variation - We can clearly see that **between counties** there's a strong **positive** relationship - But if you look **within** a given county, the relationship isn't that strong, and actually seems to be **negative** - which would make sense - if you think your chances of getting arrested are high, that should be a deterrent to crime - we are ignoring all differences between counties and looking only at differences within counties - **Fixed effects** is sometimes also referred to as the **within** estimator --- # Panel data model - The `\(it\)` subscript says this variable varies over individual `\(i\)` and time `\(t\)` `\begin{align*} Y_{it} = \alpha + X_{it}' \beta + U_{it} \end{align*}` - What if there are individual-level components in the error term causing omitted variable bias? - `\(X_{it}\)` might be related to the variable which is not in the model and thus in the error term - So we really have this then: `\begin{align*} Y_{it} = \alpha + X_{it}' \beta + \eta_i + U_{it} \end{align*}` - If you think `\(X_{it}\)` `\(\eta_i\)` are **not** correlated (based on theory, previous research), you can use both FE and RE estimators - If you think `\(X_{it}\)` `\(\eta_i\)` are correlated (based on theory, previous research), use FE estimator --- # Panel data model: simulation - Let's simulate a panel dataset ```r set.seed(7) df <- tibble(id = sort(rep(1:600, 10)), time = rep(1:10, 600), x1 = rnorm(6000), # fixed variable within individual, e.g. gender x2 = ifelse(id %% 2 == 0, 1, 0), y = id + time + 2*x1 + 50*x2 + rnorm(6000)) ``` <table> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> time </th> <th style="text-align:right;"> x1 </th> <th style="text-align:right;"> x2 </th> <th style="text-align:right;"> y </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2.2872472 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 6.4522242 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> -1.1967717 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.3436617 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -0.6942925 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 2.3772643 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.3569862 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 53.0468280 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 2.7167518 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 59.0964204 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.2814519 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 60.9361494 </td> </tr> </tbody> </table> --- # Panel data model: simulation ```r # The true effect is 2 library(plm) # package to estimate FE and RE models pooled <- plm(y ~ x1 + x2, model = "pooling", df) # or lm(y ~ x1 + x2, df) fixed <- plm(y ~ x1 + x2, model = "within", index = c("id", "time"), df) random <- plm(y ~ x1 + x2, model = "random", index = c("id", "time"), df) ``` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> <th style="text-align:center;"> Model 2 </th> <th style="text-align:center;"> Model 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> x1 </td> <td style="text-align:center;"> 1.278 </td> <td style="text-align:center;"> 1.900*** </td> <td style="text-align:center;"> 1.900*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (2.235) </td> <td style="text-align:center;"> (0.043) </td> <td style="text-align:center;"> (0.043) </td> </tr> <tr> <td style="text-align:left;"> x2 </td> <td style="text-align:center;"> 51.049*** </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> 51.033*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (4.474) </td> <td style="text-align:center;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (14.176) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 6000 </td> <td style="text-align:center;"> 6000 </td> <td style="text-align:center;"> 6000 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> - Pooled OLS estimates are off as it doesn't take into account the panel structure of data - FE and RE estimators provide unbiased estimates - FE estimator doesn't produce estimates of `\(X_2\)` as it's not varying **within** individual --- # Panel data model: simulation - Let's introduce the correlation between individual effects and individual characteristics `\begin{align*} \text{cov} (X_i, \eta_i) \neq 0 \end{align*}` ```r set.seed(7) df <- tibble(id = sort(rep(1:600, 10)), time = rep(1:10, 600), # add a correlated individual effect in x1 x1 = rnorm(6000) + 0.1*id, x2 = ifelse(id %% 2 == 0, 1, 0), y = id + time + 2*x1 + 50*x2 + rnorm(6000)) ``` --- # Panel data model: simulation ```r # The true effect is 2 pooled_corr <- plm(y ~ x1 + x2, model = "pooling", df) fixed_corr <- plm(y ~ x1 + x2, model = "within", index = c("id", "time"), df) random_corr <- plm(y ~ x1 + x2, model = "random", index = c("id", "time"), df) ``` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> <th style="text-align:center;"> Model 2 </th> <th style="text-align:center;"> Model 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> x1 </td> <td style="text-align:center;"> 11.969*** </td> <td style="text-align:center;"> 1.900*** </td> <td style="text-align:center;"> 11.720*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.008) </td> <td style="text-align:center;"> (0.043) </td> <td style="text-align:center;"> (0.023) </td> </tr> <tr> <td style="text-align:left;"> x2 </td> <td style="text-align:center;"> 49.768*** </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> 49.799*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (0.272) </td> <td style="text-align:center;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (0.791) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 6000 </td> <td style="text-align:center;"> 6000 </td> <td style="text-align:center;"> 6000 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> - Pooled OLS and RE estimates are off since `\(\text{cov} (X_i, \eta_i) \neq 0\)` - FE still provide unbiased estimates since `\(\eta_i\)` are eliminated - How does FE estimator eliminate `\(\eta_i\)`? --- # Estimation: de-meaning approach - To estimate FE model, we need to remove between variation so that all that's left is within variation - There are two main ways - **de-meaning** - **binary variables** - They give the same result (for balanced panels anyway) - Let's do de-meaning first, since it's most closely and obviously related to the "removing between variation" explanation - for each variable `\(X_{it}\)`, `\(Y_{it}\)`, etc., get the mean value of that variable for each individual `\(\bar{X}_i, \bar{Y}_i\)` - subtract out that mean to get residuals `\((X_{it} - \bar{X}_i), (Y_{it} - \bar{Y}_i)\)` - work with those residuals - `\(\alpha\)` and `\(\eta_u\)` terms get absorbed - The residuals are, by construction, no longer related to the `\(\eta_i\)` `$$Y_{it} - \bar{Y}_i = (X_{it} - \bar{X}_i)' \beta + (U_{it} - \bar{U_{i}})$$` --- # Estimation: LSDV approach - De-meaning the data is not the only way to do it - and sometimes it can make the standard errors wonky, since they don't recognize that you've estimated those means - You can also use the **least squares dummy variable** - LSDV (another word for "binary variable") method - we just treat "individual" like the categorical variable it is and add it as a control --- # Estimation: empirical example - Let's get back to the crime dataset - To demean the data, we can use `group_by` to get means-within-groups and subtract them out ```r data(crime4, package = 'wooldridge') crime4 <- crime4 %>% # filter to the data points from our graph filter(county %in% c(1,3,7, 23), prbarr < .5) %>% group_by(county) %>% mutate(mean_crime = mean(crmrte), mean_prob = mean(prbarr)) %>% mutate(demeaned_crime = crmrte - mean_crime, demeaned_prob = prbarr - mean_prob) ``` --- # Estimation: empirical example - To use least squares dummy variable, we only need to add FE as categorical variables ```r pooling <- lm(crmrte ~ prbarr, data = crime4) lsdv <- lm(crmrte ~ prbarr + factor(county), data = crime4) de_mean <- lm(demeaned_crime ~ demeaned_prob, data = crime4) ``` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> <th style="text-align:center;"> Model 2 </th> <th style="text-align:center;"> Model 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> prbarr </td> <td style="text-align:center;"> 0.049** </td> <td style="text-align:center;"> −0.030* </td> <td style="text-align:center;"> </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.017) </td> <td style="text-align:center;"> (0.012) </td> <td style="text-align:center;"> </td> </tr> <tr> <td style="text-align:left;"> demeaned_prob </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −0.030* </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (0.012) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 27 </td> <td style="text-align:center;"> 27 </td> <td style="text-align:center;"> 27 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> --- # Interpreting a within relationship - How can we interpret that slope of -0.03? - this is all **within variation** so our interpretation must be **within county** - if we think we've **causally** identified it, "raising the arrest probability by `\(1\)` percentage point in a county reduces the number of crimes per person in that county by `\(0.0003\)`" - we're basically **controlling for county**, i.e. comparing a county to itself at a different point in time - A benefit of the LSDV approach is that it calculates the fixed effects `\(\alpha_i\)` for you - interpretation is exactly the same as with a categorical variable - we have an omitted category (one county), and these show the difference relative to that omitted county - this also makes clear another element of what's happening. Just like with a categorical variable, the line is moving up and down to meet the counties - graphically, de-meaning moves all the points together in the middle to draw a line, while LSDV moves the line up and down to meet the points --- # Interpreting a within relationship <!-- --> --- # Panel data: estimation - Applied researchers rarely do either of these, and rather will use a command specifically designed for the FE estimator - `feols` in `fixest` - `felm` in `lfe` - `plm` in `plm` - `lm_robust` in `estimatr` - `feols` in `fixest` seems to be a better choice - it does all sorts of other neat stuff like fixed effects in nonlinear models like logit, regression tables, joint-test functions, and so on - it’s very fast, and can be easily adjusted to do fixed effects with other regression methods like logit, or combined with instrumental variables - it clusters the standard errors by the first fixed effect by default --- # Panel data: estimation Let's see at the output of `feols` ```r library(fixest) fe_plm <- plm(crmrte ~ prbarr, model = "within", index = "county", crime4) fe_feols <- feols(crmrte ~ prbarr | county, crime4) ``` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> <th style="text-align:center;"> Model 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> prbarr </td> <td style="text-align:center;"> −0.030* </td> <td style="text-align:center;"> −0.030* </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (0.012) </td> <td style="text-align:center;box-shadow: 0px 1px"> (0.006) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 27 </td> <td style="text-align:center;"> 27 </td> </tr> <tr> <td style="text-align:left;"> Std.Errors </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> by: county </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> --- # Fixed effects: limitations 1. Fixed effects don't control for anything that has **within** variation 1. They control away everything that's **between** only, so we can't see the effect of anything that's between only (effect of geography on crime rate? nope) 1. Anything with only a **little within** variation will have most of its variation washed out too (effect of population density on crime rate? probably not) 1. If there’s not a lot of within variation, fixed effects are going to be very noisy. Make sure there’s variation to study 1. The estimate pays the most attention to individuals with **lots of variation in treatment** - 2 and 3 can be addressed by using the RE estimator instead (although you need to be certain that `\(\text{cov} (X_i, \eta_i = 0)\)` - How can you check that? --- # Fixed or random effects? - To decide between FE or RE estimators you can run the **Hausman test** where the null hypothesis is that the preferred model is the RE estimator vs. the alternative - the FE estimator - It basically tests whether the errors are correlated with the regressors, the null hypothesis is they are not - under `\(H_0\)`: if `\(\text{cov} (X_i, \eta_i = 0)\)` both RE and FE estimators are consistent, but the RE estimator is more efficient - under `\(H_1\)`: if `\(\text{cov} (X_i, \eta_i \neq 0)\)` only FE estimator is consistent --- # Fixed or random effects? - Let's apply it to two simulated datasets with and without correlated individual effects ```r phtest(fixed, random) ``` ``` ## ## Hausman Test ## ## data: y ~ x1 + x2 ## chisq = 0.0018287, df = 1, p-value = 0.9659 ## alternative hypothesis: one model is inconsistent ``` ```r phtest(fixed_corr, random_corr) ``` ``` ## ## Hausman Test ## ## data: y ~ x1 + x2 ## chisq = 71554, df = 1, p-value < 2.2e-16 ## alternative hypothesis: one model is inconsistent ``` - As expected, we should use the RE estimator in the first model, and the FE estimator in the second model --- # Panel data: inference - It’s common to cluster standard errors at the level of the fixed effects, since it seems likely that errors would be correlated over time - it is a default function in `feols` in `fixest` - It’s possible to have more than one set of fixed effects - but interpretation gets tricky - think through what variation in `\(X\)` you’re looking at (we will discuss that in the `\(5^{\text{th}}\)` tutorial on difference-in-differences design) --- # References Books - Huntington-Klein, N. The Effect: An Introduction to Research Design and Causality, [Chapter 16: Fixed Effects](https://theeffectbook.net/ch-FixedEffects.html) - Cunningham, S. Causal Inference: The Mixtape, [Chapter 7: Panel Data](https://mixtape.scunning.com/panel-data.html) Slides - Huntington-Klein, N. Econometrics Course, [Week 6: Within Variation and Fixed Effects](https://github.com/NickCH-K/EconometricsSlides/blob/master/Week_06/Week_06_1_Within_Variation_and_Fixed_Effects.html) - Huntington-Klein, N. Causality Inference Course, [Lecture 8: Fixed Effects](https://github.com/NickCH-K/CausalitySlides/blob/main/Lecture_08_Fixed_Effects.html)